AITopics | speech response

Collaborating Authors

speech response

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Ge, Yuan, Zhang, Junxiang, Liu, Xiaoqian, Li, Bei, Ma, Xiangnan, Wang, Chenglong, Ye, Kaiyang, Du, Yangfan, Zhang, Linfeng, Huang, Yuxin, Xiao, Tong, Yu, Zhengtao, Zhu, JingBo

arXiv.org Artificial IntelligenceNov-11-2025

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.20916

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language Models

Wang, Haoyu, Zhang, Guangyan, Chen, Jiale, Li, Jingyu, Wang, Yuehai, Guo, Yiwen

arXiv.org Artificial IntelligenceSep-18-2025

With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models only convert response content into speech without fully capturing the rich emotional cues in user queries, where the same sentence may convey different meanings depending on the expression. Emotional understanding is thus essential for improving human-machine interaction. Most empathetic speech LLMs rely on massive datasets, demanding high computational cost. A key challenge is to build models that generate empathetic responses with limited data and without large-scale training. To this end, we propose Emotion Omni, a model that understands emotional content in user speech and generates empathetic responses. We further developed a data pipeline to construct a 200k emotional dialogue dataset supporting empathetic speech assistants. Experiments show that Emotion Omni achieves comparable instruction-following ability without large-scale pretraining, while surpassing existing models in speech quality (UTMOS:4.41) and empathy (Emotion GPT Score: 3.97). These results confirm its improvements in both speech fidelity and emotional expressiveness. Demos are available at https://w311411.github.io/omni_demo/.

arxiv preprint arxiv, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2508.18655

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Chiang, Cheng-Han, Wang, Xiaofei, Li, Linjie, Lin, Chung-Ching, Lin, Kevin, Liu, Shujie, Wang, Zhendong, Yang, Zhengyuan, Lee, Hung-yi, Wang, Lijuan

arXiv.org Artificial IntelligenceJul-22-2025

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.15375

Country:

Asia (0.92)
North America > United States (0.67)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

Zhang, Haonan, Luo, Run, Liu, Xiong, Wu, Yuchuan, Lin, Ting-En, Zeng, Pengpeng, Qu, Qiang, Fang, Feiteng, Yang, Min, Gao, Lianli, Song, Jingkuan, Huang, Fei, Li, Yongbin

arXiv.org Artificial IntelligenceJun-6-2025

Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.20277

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Fang, Qingkai, Zhou, Yan, Guo, Shoutao, Zhang, Shaolei, Feng, Yang

arXiv.org Artificial IntelligenceMay-6-2025

Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.02625

Country: Asia (0.93)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Gao, Heting, Shao, Hang, Wang, Xiong, Qiu, Chaofan, Shen, Yunhang, Cai, Siqi, Shi, Yuchen, Xu, Zihan, Long, Zuwei, Zhang, Yike, Dong, Shaoqi, Fu, Chaoyou, Li, Ke, Ma, Long, Sun, Xing

arXiv.org Artificial IntelligenceJan-27-2025

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

large language model, machine learning, real time system, (16 more...)

arXiv.org Artificial Intelligence

2501.16327

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment (0.48)
Health & Medicine > Therapeutic Area (0.34)
Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)

Add feedback

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Zhang, Xin, Lyu, Xiang, Du, Zhihao, Chen, Qian, Zhang, Dong, Hu, Hangrui, Tan, Chaohong, Zhao, Tianyu, Wang, Yuxuan, Zhang, Bin, Lu, Heng, Zhou, Yaqian, Qiu, Xipeng

arXiv.org Artificial IntelligenceOct-12-2024

Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoice, an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named IntrinsicVoice-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/. Large language models (LLMs) (Yang et al., 2024; Dubey et al., 2024; OpenAI, 2023) and multimodal large language models (MLLMs) (Tang et al., 2023; Chu et al., 2024; Liu et al., 2024) have exhibited exceptional performance across a variety of natural language processing tasks and multimodal comprehension tasks, allowing them to become powerful solvers for general tasks.

arxiv preprint arxiv, intrinsicvoice, sequence, (13 more...)

arXiv.org Artificial Intelligence

2410.08035

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Fang, Qingkai, Guo, Shoutao, Zhou, Yan, Ma, Zhengrui, Zhang, Shaolei, Feng, Yang

arXiv.org Artificial IntelligenceSep-10-2024

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future. Large language models (LLMs), represented by ChatGPT (OpenAI, 2022), have become powerful general-purpose task solvers, capable of assisting people in daily life through conversational interactions. However, most LLMs currently only support text-based interactions, which limits their application in scenarios where text input and output are not ideal. Recently, the emergence of GPT-4o (OpenAI, 2024) has made it possible to interact with LLMs through speech, responding to user's instruction with extremely low latency and significantly enhancing the user experience.

arxiv preprint arxiv, instruction, latexit sha1, (13 more...)

arXiv.org Artificial Intelligence

2409.06666

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Add feedback

How IBM Is Employing AI To Predict Alzheimer's Disease

#artificialintelligenceOct-11-2021, 16:53:05 GMT

IBM researchers then used NLP to analyse the participants' language sample transcripts. The model picked up tiny subtleties and changes in discourses that are generally missed if done manually. Based on this, IBM researchers trained the ML model to account for multiple variables affecting the results. Lastly, they drew on data from the subjects at the Framingham Heart Study, where the participants are assessed through two-minute Mini-Mental State Examination speech tests every four years and neuropsychological exams every year. CTT examples from FHS, including an unimpaired sample (a), an impaired sample showing telegraphic speech and lack of punctuation (b), and an even more impaired sample showing in addition significant misspellings and minimal grammatical complexity, e.g.

alzheimer, predict alzheimer, speech response, (12 more...)

#artificialintelligence

Country:

North America > Canada > Quebec > Montreal (0.06)
Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.06)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (1.00)

Technology: Information Technology > Artificial Intelligence > Applied AI (0.40)

Add feedback

Daily chats with AI could help spot early signs of Alzheimer's

#artificialintelligenceApr-24-2021, 06:05:24 GMT

But the earlier it's diagnosed, the more chances there are to delay its progression. Our joint team of researchers from IBM and University of Tsukuba has developed an AI model that could help detect the onset of the mild cognitive impairment (MCI), the transitional stage between normal aging and dementia -- by asking older people typical daily questions. In a new paper published in Frontiers in Digital Health journal, we present the first empirical evidence of tablet-based automatic assessments of patients using speech analysis -- successfully detecting mild cognitive impairment (MCI), the transitional stage between normal aging and dementia. Unlike previous studies, our AI-based model uses speech responses to daily life questions using a smartphone or a tablet app. Such questions could be as simple as inquiring someone about their mood, plans for the day, physical condition or yesterday's dinner. Earlier studies mostly focused on analyzing speech responses during cognitive tests, such as asking a patient to "count down from 925 by threes" or "describe this picture in as much detail as possible."

alzheimer, cognitive test, help spot early sign, (11 more...)

#artificialintelligence

Country: Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.26)

Genre: Research Report (0.37)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.59)
Health & Medicine > Therapeutic Area > Neurology > Dementia (0.49)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback